Learning to Explore with Meta-Policy Gradient
نویسندگان
چکیده
The performance of off-policy learning, including deep Q-learning and deep deterministic policy gradient (DDPG), critically depends on the choice of the exploration policy. Existing exploration methods are mostly based on adding noise to the on-going actor policy and can only explore local regions close to what the actor policy dictates. In this work, we develop a simple meta-policy gradient algorithm that allows us to adaptively learn the exploration policy in DDPG. Our algorithm allows us to train flexible exploration behaviors that are independent of the actor policy, yielding a global exploration that significantly speeds up the learning process. With an extensive study, we show that our method significantly improves the sample-efficiency of DDPG on a variety of reinforcement learning tasks.
منابع مشابه
Fast Online Policy Gradient Learning with SMD Gain Vector Adaptation
Reinforcement learning by direct policy gradient estimation is attractive in theory but in practice leads to notoriously ill-behaved optimization problems. We improve its robustness and speed of convergence with stochastic meta-descent, a gain vector adaptation method that employs fast Hessian-vector products. In our experiments the resulting algorithms outperform previously employed online sto...
متن کاملSIZING OPTIMIZATION OF TRUSS STRUCTURES WITH NEWTON META-HEURISTIC ALGORITHM
This study is devoted to discrete sizing optimization of truss structures employing an efficient discrete evolutionary meta-heuristic algorithm which uses the Newton gradient-based method as its updating scheme and it is named here as Newton Meta-heuristic Algorithm (NMA). In order to enable the NMA population-based meta-heuristic to effectively explore the discrete design space, a term contain...
متن کاملEquivalence Between Policy Gradients and Soft Q-Learning
Two of the leading approaches for model-free reinforcement learning are policy gradient methods and Q-learning methods. Q-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the Q-values they estimate are very inaccurate. A partial explanation may be that Q-learning methods are secretly implementing policy g...
متن کاملModel-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is to train a model on a variety of learning tasks, such that it can solve new learning tasks using...
متن کاملEvolved Policy Gradients
We propose a meta-learning approach for learning gradient-based reinforcement learning (RL) algorithms. The idea is to evolve a differentiable loss function, such that an agent, which optimizes its policy to minimize this loss, will achieve high rewards. The loss is parametrized via temporal convolutions over the agent’s experience. Because this loss is highly flexible in its ability to take in...
متن کامل